Refactor and add incremental workflow functionality. #78

haileyplusplus · 2024-04-01T01:58:43Z

Description

Add the ability to incrementally update data, refactoring the processing workflow to make this feasible.

This refactors the existing logic into multiple new classes and splits realtime and schedule processing
as much as possible. Included are a few other features:

writes out updated GeoJSON file (data.json) to data_output/scratch/frontend_data_to_wk.json.
transitfeeds.com is no longer being updated, so this now loads CTA schedules from the public S3 bucket as well.
If transitfeeds.com data isn't needed, don't scrape it at all.
Saves data downloaded from s3 and other sources to save on bandwidth.
update_data.py now takes command-line arguments.
Minor changes to progress bar display for better readability.

Resolves # [issue]

Type of change

Bug fix
New functionality
Documentation

How has this been tested?

Manually

haileyplusplus · 2024-04-01T03:23:21Z

Some documentation polish is still needed.

lauriemerrell · 2024-04-02T00:42:54Z

Thank you so much for taking this on, Hailey! Excited to talk about it in more detail tomorrow -- I skimmed today but definitely think an overview would be helpful since there's so much here.

Just wanted to respond to one note in the description for context:

In addition, this demonstrates but does not fix an existing bug in the processing logic that causes some
days on schedule version boundaries to be dropped from output. See the new utils/show_missing_days.py.

This was actually intended behavior because we decided we weren't sure which schedule should be used on boundary dates. We can just pick one if we want, but this was intentional.

haileyplusplus · 2024-04-02T01:00:00Z

Thank you so much for taking this on, Hailey! Excited to talk about it in more detail tomorrow -- I skimmed today but definitely think an overview would be helpful since there's so much here.

Just wanted to respond to one note in the description for context:

In addition, this demonstrates but does not fix an existing bug in the processing logic that causes some
days on schedule version boundaries to be dropped from output. See the new utils/show_missing_days.py.

This was actually intended behavior because we decided we weren't sure which schedule should be used on boundary dates. We can just pick one if we want, but this was intentional.

Thanks, I hadn't realized that. Good to know.

data_analysis/cache_manager.py

lauriemerrell · 2024-04-03T01:29:16Z

Reviewer TODO to myself: check how the date boundary looks with the change in logic

haileyplusplus · 2024-04-03T04:21:50Z

The download cache part of this is now in PR #80

…ansitfeeds behavior.

haileyplusplus · 2024-04-25T23:08:35Z

Thank you so much for taking this on, Hailey! Excited to talk about it in more detail tomorrow -- I skimmed today but definitely think an overview would be helpful since there's so much here.
Just wanted to respond to one note in the description for context:

In addition, this demonstrates but does not fix an existing bug in the processing logic that causes some
days on schedule version boundaries to be dropped from output. See the new utils/show_missing_days.py.

This was actually intended behavior because we decided we weren't sure which schedule should be used on boundary dates. We can just pick one if we want, but this was intentional.

Thanks, I hadn't realized that. Good to know.

I decided to keep the existing boundary behavior here. I updated the PR so that the new code that pulls scraped schedules from S3 behaves consistently.

haileyplusplus · 2024-04-25T23:11:26Z

Ready for review now! Comments welcome.

This is now basically a superset of PR #80, so if desired you could just review this.

haileyplusplus · 2024-05-01T01:29:16Z

For easier reviewing, try
git diff origin/main --color-moved-ws=allow-indentation-change --color-moved=blocks

lauriemerrell

Thank you again for tackling! A few questions.... Was mostly testing by trying to run things locally. Is there an order that I need to follow?

Specifically, when I try to run update_data, I get the following error:

python3 -m update_data                     
INFO:root: Searching page 1
INFO:root: Searching page 2
INFO:root: Searching page 3
INFO:root: Searching page 4
INFO:root: Found schedule for May 2022
INFO:root: Adding schedule for May 7, 2022
INFO:root:Processing 49 schedules.
Traceback (most recent call last):
  File "/Users/laurie/opt/anaconda3/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/Users/laurie/opt/anaconda3/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/Users/laurie/git/hailey-fork-ghost-buses/update_data.py", line 394, in <module>
    main()
  File "/Users/laurie/git/hailey-fork-ghost-buses/update_data.py", line 367, in main
    combined_long_df, summary_df = csrt.main(cache_manager, freq=freq, start_date=start_date, end_date=None,
  File "/Users/laurie/git/hailey-fork-ghost-buses/data_analysis/compare_scheduled_and_rt.py", line 296, in main
    return summarizer.main(existing)
  File "/Users/laurie/git/hailey-fork-ghost-buses/data_analysis/compare_scheduled_and_rt.py", line 282, in main
    this_iter = combiner.combine()
  File "/Users/laurie/git/hailey-fork-ghost-buses/data_analysis/compare_scheduled_and_rt.py", line 142, in combine
    schedule = self.schedule_summarizer.get_route_daily_summary()
  File "/Users/laurie/git/hailey-fork-ghost-buses/data_analysis/static_gtfs_analysis.py", line 109, in get_route_daily_summary
    trip_summary = self.make_trip_summary()
  File "/Users/laurie/git/hailey-fork-ghost-buses/data_analysis/static_gtfs_analysis.py", line 127, in make_trip_summary
    data = self.download_and_extract()
  File "/Users/laurie/git/hailey-fork-ghost-buses/data_analysis/static_gtfs_analysis.py", line 241, in download_and_extract
    cta_gtfs = zipfile.ZipFile(self.gtfs_fetcher.retrieve_file(self.schedule_feed_info))
  File "/Users/laurie/git/hailey-fork-ghost-buses/data_analysis/gtfs_fetcher.py", line 103, in retrieve_file
    local_filename, _, s3_filename, _ = self.versions[version_id]
KeyError: '20220507'

Cannot tell if I was supposed to run something to initialize whatever is causing the KeyError?

data_analysis/gtfs_fetcher.py

data_analysis/schedule_manager.py

data_analysis/compare_scheduled_and_rt.py

utils/s3_csv_reader.py

update_data.py

haileyplusplus · 2024-06-08T22:41:48Z

Thank you again for tackling! A few questions.... Was mostly testing by trying to run things locally. Is there an order that I need to follow?

Specifically, when I try to run update_data, I get the following error:

...

Cannot tell if I was supposed to run something to initialize whatever is causing the KeyError?

update_data without arguments is broken for me too. I will work on fixing it. In the meantime, you can test the incremental workflow by pointing to the existing frontend status file with something like this:

python3 -m update_data --update ../ghost-buses-frontend/src/Routes/schedule_vs_realtime_all_day_types_routes.json

haileyplusplus · 2024-06-08T22:50:31Z

Thank you again for tackling! A few questions.... Was mostly testing by trying to run things locally. Is there an order that I need to follow?
Specifically, when I try to run update_data, I get the following error:

...

Cannot tell if I was supposed to run something to initialize whatever is causing the KeyError?

update_data without arguments is broken for me too. I will work on fixing it. In the meantime, you can test the incremental workflow by pointing to the existing frontend status file with something like this:

python3 -m update_data --update ../ghost-buses-frontend/src/Routes/schedule_vs_realtime_all_day_types_routes.json

It was a simple fix for the non-arguments version, which I just committed. It should work now.

haileyplusplus · 2024-06-11T14:38:53Z

Ok, I've updated the readme and addressed open comments. I think this is ready for another round of testing and viewing. Please let me know what you think about deferring the transitfeeds changes.

haileyplusplus added 2 commits March 31, 2024 19:49

Refactor and add functionality.

1b42fec

Files modified in refactor

b353d68

lauriemerrell reviewed Apr 3, 2024

View reviewed changes

data_analysis/cache_manager.py Outdated Show resolved Hide resolved

Merge branch 'chihacknight:main' into refactor2

9bdce2f

haileyplusplus added 12 commits April 24, 2024 15:56

Merge branch 'chihacknight:main' into refactor2

c6ee240

Simplify the code a bit by removing unnecessary features.

191e415

Simplify cache manager by removing unnecessary functionality.

d5a9b35

Skip schedule change days with CTA GTFS schedules, consistent with tr…

28daf27

…ansitfeeds behavior.

Revert inadvertent change.

de28ca0

More progress bar fixes.

90501e5

Skip unnecessary scraping of transitfeeds schedule index.

9851cb3

Improve progress bar display.

45f58fe

Refactor schedule parsing and indexing code for simplicity.

a196073

Improve documentation and variable names.

e91c8e0

Simplify combiner class.

1449eb8

More comment and naming cleanup.

c3a5d04

haileyplusplus linked an issue Apr 25, 2024 that may be closed by this pull request

Automate updates to JSON files #50

Open

haileyplusplus marked this pull request as ready for review April 25, 2024 23:09

Merge branch 'main' into refactor2

97179bc

haileyplusplus added 2 commits June 4, 2024 11:40

Output frontend geojson file as part of data updates.

e5dd5c7

Merge branch 'generate-data-summary' into refactor2

5f70e73

lauriemerrell reviewed Jun 8, 2024

View reviewed changes

Fix bug in loading TransitFeeds schedules.

8e9c326

haileyplusplus closed this Jun 8, 2024

haileyplusplus reopened this Jun 8, 2024

haileyplusplus added 3 commits June 11, 2024 09:16

More updates to address review comments.

75c216a

Update main README to explain how to run update_data script.

a5fa614

Mention possible need to set PROJECT_NAME in README.

71fb7a3

lauriemerrell mentioned this pull request Jun 19, 2024

Refactor Transitfeeds logic & source #85

Open

Merge branch 'chihacknight:main' into refactor2

a22e3a8

lauriemerrell mentioned this pull request Sep 20, 2024

update data using new back-end code chihacknight/ghost-buses-frontend#146

Open

5 tasks

lauriemerrell added 2 commits September 19, 2024 20:55

change date format to address front end issue

53cf8a6

update date format and data.json for compatibility with front end

505cf9f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor and add incremental workflow functionality. #78

Refactor and add incremental workflow functionality. #78

haileyplusplus commented Apr 1, 2024 •

edited

Loading

haileyplusplus commented Apr 1, 2024

lauriemerrell commented Apr 2, 2024

haileyplusplus commented Apr 2, 2024

lauriemerrell commented Apr 3, 2024

haileyplusplus commented Apr 3, 2024

haileyplusplus commented Apr 25, 2024

haileyplusplus commented Apr 25, 2024 •

edited

Loading

haileyplusplus commented May 1, 2024

lauriemerrell left a comment

haileyplusplus commented Jun 8, 2024

haileyplusplus commented Jun 8, 2024

haileyplusplus commented Jun 11, 2024

Refactor and add incremental workflow functionality. #78

Are you sure you want to change the base?

Refactor and add incremental workflow functionality. #78

Conversation

haileyplusplus commented Apr 1, 2024 • edited Loading

Description

Type of change

How has this been tested?

haileyplusplus commented Apr 1, 2024

lauriemerrell commented Apr 2, 2024

haileyplusplus commented Apr 2, 2024

lauriemerrell commented Apr 3, 2024

haileyplusplus commented Apr 3, 2024

haileyplusplus commented Apr 25, 2024

haileyplusplus commented Apr 25, 2024 • edited Loading

haileyplusplus commented May 1, 2024

lauriemerrell left a comment

Choose a reason for hiding this comment

haileyplusplus commented Jun 8, 2024

haileyplusplus commented Jun 8, 2024

haileyplusplus commented Jun 11, 2024

haileyplusplus commented Apr 1, 2024 •

edited

Loading

haileyplusplus commented Apr 25, 2024 •

edited

Loading